(a) Sketch the tree corresponding to the partition of the predictor space illustrated in the left-hand panel of the figure above. The numbers inside the boxes indicate the mean of Y within each region.
Answer
Decision Tree.
(b) Create a diagram similar to the left-hand panel of the figure, using the tree illustrated in the right-hand panel of the same figure. You should divide up the predictor space into the correct regions, and indicate the mean for each region.
Answer
Decision Tree.
In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.
(a) Split the data set into a training set and a test set.
Answer
library(ISLR)
attach(Carseats)
set.seed(1)
train <- sample(dim(Carseats)[1],dim(Carseats)[1]/2) #train sample
Carseats.train <- Carseats[train,] #Creating Carseats.train with training data
Carseats.test <- Carseats[-train,] #Creating Carseats.test with test data
(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test error rate do you obtain?
Answer
library(tree)
tree.carseats <- tree(Sales ~ ., data = Carseats,subset = train) # Fitting Regression Tree
summary(tree.carseats) #Summary the results
##
## Regression tree:
## tree(formula = Sales ~ ., data = Carseats, subset = train)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Age" "Advertising" "Income"
## [6] "CompPrice"
## Number of terminal nodes: 18
## Residual mean deviance: 2.36 = 429.5 / 182
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.2570 -1.0360 0.1024 0.0000 0.9301 3.9130
plot(tree.carseats) #Plot the tree
text(tree.carseats, pretty = 0)
pred.carseats <- predict(tree.carseats,Carseats.test) #predict
mean((Carseats.test$Sales - pred.carseats)^2) #MSE
## [1] 4.148897
sqrt(mean((Carseats.test$Sales - pred.carseats)^2))
## [1] 2.036884
We have produced a regression tree with Sales as the response and the other variables as predictors. The tree has actually used six variables “ShelveLoc”, “Price”, “Age”, “Advertising”,“Income”,“CompPrice”. The most important variable in order to predict sales is ShelveLoc as the second most important is Price.
Our tree has eighteen (18) terminal nodes and the Residual mean deviance for the tree is 2.36.A small residual mean deviance indicates a good fit to the training data. Test MSE of our tree is 4.1.
In order to intepret the results better we can calculate RMSE which is equal to 2.04. That means that when we make a prediction based on our model,we expected to miss on average almost 2 units of sales in any prediction.
(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test error rate?
Answer
set.seed(1)
#Using cv.tree() function to determine the optimal level of tree complexity
cv.carseats <- cv.tree(tree.carseats)
cv.carseats
## $size
## [1] 18 17 16 15 14 12 11 10 9 8 7 6 5 4 3 1
##
## $dev
## [1] 1127.510 1163.270 1163.270 1169.097 1169.097 1153.636 1129.237
## [8] 1139.639 1139.232 1109.839 1135.897 1126.203 1218.821 1205.007
## [15] 1331.927 1612.664
##
## $k
## [1] -Inf 15.48181 15.53599 18.69038 18.74886 21.05038 23.79480
## [8] 25.78579 26.01210 30.10435 32.74801 53.28569 72.33061 78.19599
## [15] 141.73781 251.22901
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
In this case,tree of size 8 is selected by cross-validation as the deviance has the lowest value equal to 1109.839 .
We can also plot results to have a better look between size and deviance. We can easily observe that in our case a tree size of 8 gives as the lowest cross validation error as the deviance has the lowest price.
set.seed(1)
plot(cv.carseats$size,cv.carseats$dev,type='b')
# Pruning a Regression Tree , best size = 8
prune.carseats <- prune.tree(tree.carseats,best=8)
summary(prune.carseats) # summary the results with 8 terminal nodes
##
## Regression tree:
## snip.tree(tree = tree.carseats, nodes = c(39L, 11L, 8L, 18L,
## 7L, 10L))
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Age"
## Number of terminal nodes: 8
## Residual mean deviance: 3.363 = 645.7 / 192
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.28400 -1.12000 0.02695 0.00000 0.99010 4.59800
plot(prune.carseats) # plot the prunned tree
text(prune.carseats, pretty = 0)
yhat.prune = predict(prune.carseats,newdata = Carseats.test)
mean((yhat.prune - Carseats.test$Sales)^2)
## [1] 5.09085
By using the pruned tree to make predictions on the test set we estimated test MSE to be higher(test MSE = 5.1) than the unpruned tree. That would generally mean that our model may suffer from underfitting.
It is worth mentioning that, a pruned tree will increase the bias and if the reduction of the variance is not significant we may see an increase to MSE(as in our case). Although, we get higher intepretability in our model as we have smaller size in our tree.
(d) Use the bagging approach in order to analyze this data. What test error rate do you obtain? Use the importance() function to determine which variables are most important.
Answer
set.seed(1)
library(randomForest) #insert the library
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
bag.carseats <- randomForest(Sales ~. , data = Carseats.train, mtry = 10,ntree=500, importance = TRUE) # number of predictors sampled for splitting at each node // 10 indicates that all 10 predictors should be consider for each split of the tree, this indicates bagging
yhat.bag <- predict(bag.carseats, newdata = Carseats.test)
mean((yhat.bag - Carseats.test$Sales)^2)
## [1] 2.614642
We can see how the bagging approach decreases the Test MSE to 2.6.We construct B regression trees where B is our boostrapped training datasets and then we average the resulting predictions of those B datasets in order to estimate the MSE.
It’s worth mentioning, that those B trees are not pruned, so the value of bias estimated to be very low but by averaging these tree we can achieve also a lower variance.As a result, we have a reduction in our MSE which is of course the sum of bias and variance.
In addition, we can now use the importance() function to determine which variables are most important.
importance(bag.carseats)
## %IncMSE IncNodePurity
## CompPrice 16.4714051 126.605047
## Income 4.0561872 78.821925
## Advertising 16.2730251 122.793232
## Population 0.7711188 62.796112
## Price 54.5571815 512.940454
## ShelveLoc 42.4486118 320.749734
## Age 20.5369414 184.804253
## Education 2.7755968 42.427788
## Urban -2.3962157 8.583232
## US 7.2258536 17.605661
We can conclude that “Price” and “ShelvecLoc” are the most important variables in order to predict Sales.
(e) Use random forests to analyze this data. What test error rate do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.
Answer
set.seed(1)
rf.carseats <- randomForest(Sales ~ ., data = Carseats.train, mtry = 3, ntree = 500, importance = TRUE)
yhat.rf <- predict(rf.carseats , newdata = Carseats.test)
mean((yhat.rf - Carseats.test$Sales)^2)
## [1] 3.237463
Random Forests have the same idea as bagging approach,although, each time we build a tree on bootstrapped training sample, we choose m = \(\sqrt{p}\) or m = p/3 predictors from the full set of p predictors(bagging approach).By averaging many highly correlated quantities, we don’t get large reduction to variance. In order to avoid that, we leave very strong predictors randomly out in each split and “de-corelate” the bagged trees leading to more reduction in variance.
In this case, random forests with m = \(p/3\) (usually picked for regression trees) have a higher test MSE equal to 3.2 than the bagging approach where we used all the predictors(m=p) for our B different regression trees.That means that we need those strong predictors inside our each split in order to have a better prediction. Although,we can also see a reduction to the MSE compared to our first simple regression tree methods (pruned and unpruned trees).
importance(rf.carseats)
## %IncMSE IncNodePurity
## CompPrice 8.0753055 122.68542
## Income 4.7774822 128.79339
## Advertising 12.6417009 139.29703
## Population 0.6016753 94.79730
## Price 37.6994519 382.76169
## ShelveLoc 32.8805652 239.26395
## Age 17.6866850 208.74153
## Education 0.4932141 66.98484
## Urban 0.5287570 16.29274
## US 7.8534356 31.39177
We may notice that, in this case “Price” and “Shelveloc” are the two most important values in order to predict the Sales of a car . Its worth mentioning, that random forests de-corelates our two most important values of Price and Shelveloc and drop their values to 37 and 32 compared to 52 and 42 as we expected.
This problem involves the OJ data set which is part of the ISLR package.
(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
Answer
library(ISLR)
set.seed(1)
summary(OJ) #summary our data
## Purchase WeekofPurchase StoreID PriceCH PriceMM
## CH:653 Min. :227.0 Min. :1.00 Min. :1.690 Min. :1.690
## MM:417 1st Qu.:240.0 1st Qu.:2.00 1st Qu.:1.790 1st Qu.:1.990
## Median :257.0 Median :3.00 Median :1.860 Median :2.090
## Mean :254.4 Mean :3.96 Mean :1.867 Mean :2.085
## 3rd Qu.:268.0 3rd Qu.:7.00 3rd Qu.:1.990 3rd Qu.:2.180
## Max. :278.0 Max. :7.00 Max. :2.090 Max. :2.290
## DiscCH DiscMM SpecialCH SpecialMM
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.05186 Mean :0.1234 Mean :0.1477 Mean :0.1617
## 3rd Qu.:0.00000 3rd Qu.:0.2300 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :0.50000 Max. :0.8000 Max. :1.0000 Max. :1.0000
## LoyalCH SalePriceMM SalePriceCH PriceDiff
## Min. :0.000011 Min. :1.190 Min. :1.390 Min. :-0.6700
## 1st Qu.:0.325257 1st Qu.:1.690 1st Qu.:1.750 1st Qu.: 0.0000
## Median :0.600000 Median :2.090 Median :1.860 Median : 0.2300
## Mean :0.565782 Mean :1.962 Mean :1.816 Mean : 0.1465
## 3rd Qu.:0.850873 3rd Qu.:2.130 3rd Qu.:1.890 3rd Qu.: 0.3200
## Max. :0.999947 Max. :2.290 Max. :2.090 Max. : 0.6400
## Store7 PctDiscMM PctDiscCH ListPriceDiff
## No :714 Min. :0.0000 Min. :0.00000 Min. :0.000
## Yes:356 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.140
## Median :0.0000 Median :0.00000 Median :0.240
## Mean :0.0593 Mean :0.02731 Mean :0.218
## 3rd Qu.:0.1127 3rd Qu.:0.00000 3rd Qu.:0.300
## Max. :0.4020 Max. :0.25269 Max. :0.440
## STORE
## Min. :0.000
## 1st Qu.:0.000
## Median :2.000
## Mean :1.631
## 3rd Qu.:3.000
## Max. :4.000
train <- sample(dim(OJ)[1],800) #train sample with 800 observations
OJ.train <- OJ[train,] #create OJ.train
OJ.test <- OJ[-train,] #create OJ.test(remaining observations)
(b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?
Answer
library(tree)
oj.tree <- tree(Purchase ~ ., data = OJ.train,method="class") # Fit a tree to the training data with Purchase as the response and the other variables as predictors
summary(oj.tree) #summary our tree
##
## Classification tree:
## tree(formula = Purchase ~ ., data = OJ.train, method = "class")
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "SpecialCH" "ListPriceDiff"
## Number of terminal nodes: 8
## Residual mean deviance: 0.7305 = 578.6 / 792
## Misclassification error rate: 0.165 = 132 / 800
We have produced a classification tree with Purchace as the response and the other variables as predictors. The tree has actually used four variables “LoyalCH”, “PriceDiff”, “SpecialCH” and “ListPriceDiff”. Our tree has eight (8) terminal nodes and the Training Error rate (Misclassification error rate) for the tree is 0.165
(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
Answer
oj.tree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1064.00 CH ( 0.61750 0.38250 )
## 2) LoyalCH < 0.508643 350 409.30 MM ( 0.27143 0.72857 )
## 4) LoyalCH < 0.264232 166 122.10 MM ( 0.12048 0.87952 )
## 8) LoyalCH < 0.0356415 57 10.07 MM ( 0.01754 0.98246 ) *
## 9) LoyalCH > 0.0356415 109 100.90 MM ( 0.17431 0.82569 ) *
## 5) LoyalCH > 0.264232 184 248.80 MM ( 0.40761 0.59239 )
## 10) PriceDiff < 0.195 83 91.66 MM ( 0.24096 0.75904 )
## 20) SpecialCH < 0.5 70 60.89 MM ( 0.15714 0.84286 ) *
## 21) SpecialCH > 0.5 13 16.05 CH ( 0.69231 0.30769 ) *
## 11) PriceDiff > 0.195 101 139.20 CH ( 0.54455 0.45545 ) *
## 3) LoyalCH > 0.508643 450 318.10 CH ( 0.88667 0.11333 )
## 6) LoyalCH < 0.764572 172 188.90 CH ( 0.76163 0.23837 )
## 12) ListPriceDiff < 0.235 70 95.61 CH ( 0.57143 0.42857 ) *
## 13) ListPriceDiff > 0.235 102 69.76 CH ( 0.89216 0.10784 ) *
## 7) LoyalCH > 0.764572 278 86.14 CH ( 0.96403 0.03597 ) *
We will pick randomly node 20) in order to interpred the information displayed.
First, we can see the name of the splitting variable (SpecialCH) and the split criterion which in this case is SpecialCH < 0.5. Then we have the number of observations in that branch (70) and the deviance = 60.89.Next, MM is the the classification of Sales chosen from the possible classes of CH or MM. In addition,we get the fraction of observations in that branch that take on values of CH(16%) and MM(84%) where the first fraction indicates CH and the second indicates MM. Lastly, asterisk ( * ) indicates that our branch lead to terminal node.
(d) Create a plot of the tree, and interpret the results.
Answer
plot(oj.tree)
text(oj.tree, pretty = 0)
We can conclude that “LoyalCH” is the most important variable of the tree. If we have values that LoyalCH is smaller than 26% we predict that our sales value is MM otherwise if LoyalCH is greater than 76% our sales value is CH. In the middle where 26% < LoyalCH < 76% our decision depends also from PriceDiff , ListPriceDiff and SpecialCH.It is worth mentioning, that the response value of MM is predicted in both situation on the left side of our tree. The split is performed because it leads to increased node purity even though it does not reduce the classification error.
(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
Answer
oj.pred.test <- predict(oj.tree, OJ.test, type = "class")
table(oj.pred.test,OJ.test$Purchase)
##
## oj.pred.test CH MM
## CH 147 49
## MM 12 62
missclass <- sum(OJ.test$Purchase != oj.pred.test)
missclass/length(oj.pred.test) # (49+12)/(147+62+49+12) test error rate
## [1] 0.2259259
Test Error Rate is approximately 22.6 %.
(f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.
Answer
set.seed(1)
cv.oj <- cv.tree(oj.tree,FUN=prune.misclass)
cv.oj
## $size
## [1] 8 5 2 1
##
## $dev
## [1] 152 152 161 306
##
## $k
## [1] -Inf 0.000000 4.666667 160.000000
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
We can see that the lowest cross validation error is for the size = 5 and size = 8 where the cross validation error is equal to 152. Early pruning method, indicates that we have to prune our tree at size = 5 if there is no significant reduce in our cross validation error after that point. By pruning the tree to be size = 5 , we can get no only a better interpetation but also our model is more efficient.
(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
Answer
The plot of size and deviance indicates that a tree with 5 terminal nodes and 8 terminal nodes will have the same deviance as we expected.
plot(cv.oj$size, cv.oj$dev, type = "b" , xlab ="Tree size", ylab = " Deviance")
(h) Which tree size corresponds to the lowest cross-validated classification error rate?
Answer
We can observe that a tree size of 5 nodes and tree of 8 nodes have the lowest cross-validation error.
(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
Answer
oj.prune <- prune.misclass(oj.tree, best = 5)
plot(oj.prune)
text(oj.prune,pretty=0)
We can see that our pruned tree with size = 5 contains thre variables LoyalCH, PriceDiff and SpecialCH as in the unpruned tree we had 4. Of course, LoyalCH is the most important variable, as the PriceDiff is the second most important.
(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?
summary(oj.prune)
##
## Classification tree:
## snip.tree(tree = oj.tree, nodes = 3:4)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "SpecialCH"
## Number of terminal nodes: 5
## Residual mean deviance: 0.8256 = 656.4 / 795
## Misclassification error rate: 0.165 = 132 / 800
Training Error Rate in pruned tree is 0.165 same asthe training error rate in the unpruned tree. As it has already been mentioned, with smaller tree we gain higher intepretability in our model and it is more efficient because we build smaller tree.
(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?
Answer
yhat.pruned <- predict(oj.prune, OJ.test, type = "class")
table(yhat.pruned,OJ.test$Purchase)
##
## yhat.pruned CH MM
## CH 147 49
## MM 12 62
missclass.pruned <- sum(OJ.test$Purchase != yhat.pruned)
missclass.pruned/length(yhat.pruned)
## [1] 0.2259259
In this case, Pruned tree has exactly the same test error rate equal to 22.6% as the unpruned tree. Of course, this could change if we set different seeds during the spliting of the data into test and training set and also the cross validation stage.
In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.
(a) Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.
Answer
library(ISLR)
Auto_b <- Auto
gas.median <- median(Auto$mpg)
bin.var <- ifelse(Auto$mpg > gas.median, 1 , 0) #create the binary variable
Auto_b$mpglevel = as.factor(bin.var)
Auto_b$mpg <-NULL #remove mpg as we have already created mpglevel
Auto_b$name <- NULL #remove name
(b) Fit a support vector classifier to the data with various values of cost, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter. Comment on your results.
Answer
library(e1071)
set.seed(1)
tune.out <-tune(svm,mpglevel~.,data=Auto_b,kernel="linear",ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 100
##
## - best performance: 0.08679487
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-03 0.13544872 0.05715486
## 2 1e-02 0.09179487 0.05812971
## 3 1e-01 0.09948718 0.06331482
## 4 1e+00 0.09192308 0.05827672
## 5 5e+00 0.08935897 0.05443327
## 6 1e+01 0.08935897 0.05443327
## 7 1e+02 0.08679487 0.05567394
After, we have compared SVMs with a linear kernel using a range of values of the cost parameter we found that cost=100 has the best performance of 0.087.On the other hand, we can observe that for cost = 0.001 we have the highest cross-validation error to be 0.135. In order to find the optimum parameter, tune function has used by default 10-fold cross validation and found all differents results for differents cost values.
(c) Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of gamma and degree and cost. Comment on your results.
Answer
set.seed(1)
tune.out <- tune(svm, mpglevel ~ ., data = Auto_b, kernel = "polynomial" , ranges = list(cost = c(0.001,0.01,0.1,1,5,10,100), degree = c(2,3,4)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost degree
## 5 3
##
## - best performance: 0.07910256
##
## - Detailed performance results:
## cost degree error dispersion
## 1 1e-03 2 0.56115385 0.04344202
## 2 1e-02 2 0.48455128 0.08604319
## 3 1e-01 2 0.28333333 0.08903669
## 4 1e+00 2 0.25775641 0.10016352
## 5 5e+00 2 0.18846154 0.07426805
## 6 1e+01 2 0.18846154 0.06917535
## 7 1e+02 2 0.16570513 0.07729330
## 8 1e-03 3 0.46711538 0.05079366
## 9 1e-02 3 0.26814103 0.09451019
## 10 1e-01 3 0.20429487 0.10736879
## 11 1e+00 3 0.08929487 0.06309033
## 12 5e+00 3 0.07910256 0.05600672
## 13 1e+01 3 0.07910256 0.05600672
## 14 1e+02 3 0.08416667 0.05540253
## 15 1e-03 4 0.52544872 0.06493340
## 16 1e-02 4 0.37788462 0.06752086
## 17 1e-01 4 0.27076923 0.09243581
## 18 1e+00 4 0.21935897 0.08632997
## 19 5e+00 4 0.19653846 0.10103932
## 20 1e+01 4 0.17339744 0.09552369
## 21 1e+02 4 0.15038462 0.07729445
For polynomial SVMs we get the best performance (lowest cross-validation error) to be 0.079 for cost = 5 and degree = 3. It is worth mentioning,that tune function has used 10-fold cross validation by default and had found exactly the same error for cost = 10 and degree = 3.Last but not least,we have the highest cross-validation error equal to 0.56 for cost=0.01 and degree=2
set.seed(1)
tune.out <- tune(svm, mpglevel ~ ., data = Auto_b, kernel = "radial" , ranges = list(cost = c(0.1,1,10,100,1000), gamma = c(0.5,1,2,3,4)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 1 1
##
## - best performance: 0.07391026
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 1e-01 0.5 0.08923077 0.05559893
## 2 1e+00 0.5 0.07391026 0.05315495
## 3 1e+01 0.5 0.07897436 0.04567335
## 4 1e+02 0.5 0.10448718 0.06862934
## 5 1e+03 0.5 0.10448718 0.06072287
## 6 1e-01 1.0 0.08923077 0.06408390
## 7 1e+00 1.0 0.07391026 0.05176240
## 8 1e+01 1.0 0.08916667 0.05799733
## 9 1e+02 1.0 0.10698718 0.07386765
## 10 1e+03 1.0 0.10698718 0.07386765
## 11 1e-01 2.0 0.13512821 0.05778941
## 12 1e+00 2.0 0.07634615 0.05879482
## 13 1e+01 2.0 0.09423077 0.07220895
## 14 1e+02 2.0 0.09923077 0.07799865
## 15 1e+03 2.0 0.09923077 0.07799865
## 16 1e-01 3.0 0.30128205 0.11716574
## 17 1e+00 3.0 0.07634615 0.05867200
## 18 1e+01 3.0 0.09673077 0.06968463
## 19 1e+02 3.0 0.09923077 0.07316609
## 20 1e+03 3.0 0.09923077 0.07316609
## 21 1e-01 4.0 0.53820513 0.07278281
## 22 1e+00 4.0 0.08660256 0.05385501
## 23 1e+01 4.0 0.10179487 0.06661035
## 24 1e+02 4.0 0.10179487 0.06966782
## 25 1e+03 4.0 0.10179487 0.06966782
Finally, for radial kernel, the lowest cross-validation error is 0.074 for cost = 1 and gamma = 1 which is the best performance between all models (linear,polynomial,radial).
(d) Make some plots to back up your assertions in (b) and (c).
Answer
svm.linear1 <- svm(mpglevel ~ weight + acceleration, data = Auto_b, kernel = "linear", cost = 100) #fit a svm linear model only for mpg with predictors to be weight abd acceleration
plot(svm.linear1, Auto_b, weight~acceleration)
svm.poly <- svm(mpglevel ~ displacement + cylinders,data = Auto_b, kernel = "polynomial",cost =5,degree=3)
plot(svm.poly,Auto_b, displacement~cylinders ) #
svm.radial2 <- svm(mpglevel ~ origin + horsepower, data = Auto_b ,kernel = "radial", cost = 1 , gamma = 1)
plot(svm.radial2,Auto_b,horsepower~origin) #svm with radial kernel , plot mpg ~ acceleration
svm.linear2 <- svm(mpglevel ~ weight + cylinders, data = Auto_b, kernel = "linear", cost = 100)
plot(svm.linear2,Auto_b,weight~cylinders) # svm with linear kernel, plot mpg ~weight
svm.radial3 <- svm(mpglevel ~ acceleration + horsepower, data = Auto_b ,kernel = "radial", cost = 1 , gamma = 1)
plot(svm.radial3,Auto_b,horsepower~acceleration)
We can easily observe from the plots, how in general radial kernels seperate our data in a better and more flexible way. This is what we expected to see, as we have already got the lowest cross validation error.
It is also easy to observe, how the model can overfit the data if we raise sharply our gamma parameter in the graph below.In that case, we expect to have of course higher test MSE and acceleration and hoursepower would not predict very well whether a car gets high or low gas mileage (pink and blue font) .
svm.radial4 <- svm(mpglevel ~ acceleration + horsepower, data = Auto_b ,kernel = "radial", cost = 1 , gamma = 100)
plot(svm.radial4,Auto_b,horsepower~acceleration)
Here we explore the maximal margin classifier on a toy data set. (a) We are given n = 7 observations in p = 2 dimensions. For each observation, there is an associated class label. Sketch the observations.
Answer
x1 <- c(3,2,4,1,2,4,4)
x2 <- c(4,2,4,4,1,3,1)
colours = c("red","red","red","red","blue","blue","blue")
plot(x1,x2,col = colours , xlab = "X1", ylab = "X2" )
(b) Sketch the optimal separating hyperplane, and provide the equation for this hyperplane of the following form.
\(\beta_0 + \beta_1X_1 + \beta_2X_2 = 0\)
Answer
In our case,classes are linearly seperable and we fit the optimal seperating hyperplane between observations(support vectors) ‘2’,‘3’ and ‘5’,‘6’.
For the first pair of observations we have (2,2) and (2,1) and the second pair is (4,4) and (4,3) if we substract them we have
=> (2,1.5),(4,3.5) pair in which we will fit our hyperplane
\(y = mx + b\)
Gradient(slope) \(m = (y_2 - y_1)/(x_2 - x_1)\)
In our case \(m = (3.5-1.5)/(4-2) = 1\)
so for m = 1 => y = x + b
We take one point of our line e.g. (2,1.5) we have 1.5 = 2 + b => b = -0.5
so for b = -0.5 and m = 1 our equation will be calculated as \(X_1 = X_2 - 0.5\) and our hyperplane would be \(0.5 + X_1 - X_2 = 0\)
plot(x1,x2,col = colours,xlim = c(0,5),ylim = c(0,5))
abline(-0.5,1)
(c) Describe the classification rule for the maximal margin classifier. It should be something along the lines of “Classify to Red if \(\beta_0 + \beta_1X_1 + \beta_2X_2 > 0\), and classify to Blue otherwise.” Provide the values for \(\beta_0, \beta_1\), and \(\beta_2\).
Answer
In general, we can use an infinite number of hyperplanes in order to seperate our data.After we found a possible hyperplane we can compute the perpendicular distance from each observation to our hyperplane,then the minimum distance is called the margin. As a result,the maximal margin hyperplane is a hyperplane that has the largest margin or the farthest minimum distance to the observations. Building a maximal margin hyperplane gives us the ability to classify further a test observation based on which side of the maximal margin hyperplane it is. . In our case the classification rule for the maximal margin classifier will be : “Classify to Red if \(0.5 + X_1 - X_2 > 0\) and”Classify to Blue otherwise"
If we want for example to classify a test obsercation (\(X_1,X_2\)) we will apply the classification rule for the maximal margin classifier and see where our observation lies.
(d) On your sketch, indicate the margin for the maximal margin hyperplane.
Answer
plot(x1,x2,col=colours,xlim = c(0,5) , ylim = c(0,5))
abline(-0.5,1)
abline(-1,1,lty = 2) # blue support vectors (they are on the margin and support the maximal margin line)
abline(0,1,lty = 2) #red support vectors (they are on the margin and support the maximal margin line)
lines(x = c(2,1.9), y = c(1,1.39))
text(x = 2.2 , y = 1.4 , labels = "margin", cex = 0.8)
text(x = 2.4 , y = 2.1 , labels = "margin", cex = 0.8)
lines(x = c(2,2.1),y = c(2,1.60))
The margin is the perpendicular distance between support vectors and the seperation line(hyperplane). From the general formula the distance from a point \((x_0,y_0)\) to the line \(\beta_0 + \beta_1X_1 + \beta_2X_2=0\) is:
\(d = \frac{\mid{x_0\beta_1 + y_0\beta_2 + \beta_0}\mid}{\sqrt{\beta_1^2 + \beta_2^2}}\)
so for \((2,2)\) and \(X_1 - X_2 -0.5=0\) , \(margin = \frac{\sqrt2}{4}\)
(e) Indicate the support vectors for the maximal margin classifier.
Answer
The support vectors are the points (2,1),(2,2),(4,3),(4,3). These four observations, are exactly on the margin and “support” the maximal margin hyperplane. If these four points were moved, then the maximal margin line yould move as well.
plot(x1,x2,col=colours,xlim = c(0,5) , ylim = c(0,5))
abline(-0.5,1)
abline(-1,1,lty = 2)
abline(0,1,lty = 2)
points(2,2, pch=13,col="black",cex=2)
points(2,1, pch=13,col="black",cex=2)
points(4,4, pch=13,col="black",cex=2)
points(4,3, pch=13,col="black",cex=2)
(f) Argue that a slight movement of the seventh observation would not affect the maximal margin hyperplane.
Answer
A slight movement of the seventh observation (4,1) will not affect the maximal margin hyperplane. The maximal margin hyperplane as we said before depends only from the supports vectors which in our case are (2,1),(2,2),(4,3),(4,4). Any movements of the other observations will not affect our hyperplane as long as they stay outside the boundary of our margin.
(g) Sketch a hyperplane that is not the optimal separating hyperplane, and provide the equation for this hyperplane.
Answer
We can choose a line that seperates our two classes but without being the optimal seperating hyperplane.
plot(x1,x2,col=colours,xlim = c(0,5) , ylim = c(0,5))
abline(-0.55,1.1)
Then our new hyperplane will have an equation such as :
\(0.55 + X_1 - 1.1X_2 = 0\) (Not the optimal hyperplane)
(h) Draw an additional observation on the plot so that the two classes are no longer separable by a hyperplane.
Answer
Lets add for example 8th observation,the pair of (4,2) class of “red”. Then our plot will be :
plot(x1,x2,col=colours,xlim = c(0,5) , ylim = c(0,5))
points(4,2,col="red")
text(x=4.5,y=2,labels = "8th observation",cex=0.7)
We can easily observe that there is not a linear seperation between our two classes and we have to examine different ways (e.g. polynomial, radial) to seperate our classes successfully.
Consider the USArrests data. We will now perform hierarchical clustering on the states.
(a) Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
Answer
library(ISLR)
set.seed(1)
summary(USArrests)
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
hc.complete <- hclust(dist(USArrests),method="complete")
plot(hc.complete, main = "Complete Linkage",xlab = "",ylab="",cex=0.9)
The hclust() function implements hierarchical custering while we used “complete” (linkage clustering) method on USArrests data set from ISLR library.
(b) Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?
Answer
First, we will use cutree function with three as an argument in order to cut our tree in three distinct clusters
cutree(hc.complete,3)
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 2 3 1 1 2
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 1 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 3 1 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 1 3 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 3 2 1
## South Dakota Tennessee Texas Utah Vermont
## 3 2 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 3 3 2
From the table above we can see exactly the states and the cluster which are belong to. For example Michigan belongs to the first cluster,Pennsylvania to the third cluster and New Jersey to the second cluster. With the “table” function we can also observe that twenty states in total belong to the thrird cluster,sixteen states belong to the first cluster and lastly fourteen states belong to the second cluster.
table(cutree(hc.complete,3))
##
## 1 2 3
## 16 14 20
(c) Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.
Answer
xsc <- scale(USArrests) #Used to scale the variables before performing Hierarchically cluster
hc.scomplete <- hclust(dist(xsc), method = "complete")
plot(hc.scomplete,main = "US Arrests (scaled)")
As the variables are scaled to have standard deviation one,then, each variable will effect equally our Hierarchical clustering method and the same importance will be given to each one of them.
(d) What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.
Answer
cutree(hc.scomplete,3)
## Alabama Alaska Arizona Arkansas California
## 1 1 2 3 2
## Colorado Connecticut Delaware Florida Georgia
## 2 3 3 2 1
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 2 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 2
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 2 3 1 3
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 2 3 3
## New Mexico New York North Carolina North Dakota Ohio
## 2 2 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 1
## South Dakota Tennessee Texas Utah Vermont
## 3 1 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 3 3 3
table(cutree(hc.scomplete,3))
##
## 1 2 3
## 8 11 31
Scaling the variable effects not only the height of our dendrogram, but also the clusters,where our first cluster after scaling has eight states instead of 16, the second has eleven instead of fourteen and the third cluster has thirty-one states instead of twenty states.
Our dataset contains statistics in arrests per 100,000 residents for Assualt, Murder and Rape in each of the 50 US states. Although the population living in the urban areas is measured in percentage. In my opinion,we have to scale our data mostly because our variables are measured on different scales,otherwise the choice of units(per-100.000 and percent) will greatly affect the dissimilarity measure obtained.
In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data.
Hint: There are a number of functions in R that you can use to generate data. One example is the rnorm() function; runif() is another option. Be sure to add a mean shift to the observations in each class so that there are three distinct classes.
Answer
set.seed(1)
data <- matrix(nrow=60, ncol=50)
# for i in every column we split our data in to 3 different regions in order to create three different classes. We choose mean sample from 1 to 100 and for the different classes we choose different values for mean (e.g. mean=6 and mean=-6)
for(i in 1:ncol(data)) {
mean <- sample(1:100, size = 1)
data[1:20,i] <- rnorm(20,mean=6)
data[21:40,i]<- rnorm(20, mean=0)
data[41:60,i]<- rnorm(20, mean=-6)
}
plot(data)
(b) Perform PCA on the 60 observations and plot the first two principal components??? eigenvector. Use a different color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (c). If not, then return to part (a) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (c) until the three classes show at least some separation in the first two principal component eigenvectors.
Answer
First, we create data_pca which is our PCA method on the 60 observations and after that we plot the first two principal compontents eigenvector with different colours in order to indicate our three different classes.
data_pca <- prcomp(data, scale = TRUE)
plot(data_pca$x[,1],data_pca$x[,2],col=c(rep("green",20),rep("red",20),rep("purple",20)),pch = 20, xlab = "First Principal Component",ylab = "Second Principal Component ")
(c) Perform K-means clustering of the observations with K = 3. How well do the clusters that you obtained in K-means clustering compare to the true class labels?
Hint: You can use the table() function in R to compare the true class labels to the class labels obtained by clustering. Be careful how you interpret the results: K-means clustering will arbitrarily number the clusters, so you cannot simply check whether the true class labels and clustering labels are the same.
Answer
We performed K-means algorithm with K = 3 and nstart = 20.Initially, K-means algorithm randomly assigns classses and computes clusters centroids. After that, it assigns observations to the closest cluster centroid and then repeats the process for as many iterations as we have defined in nstart.
From the table, we can see that our K-means method had succesfully divided our classes in three different subgroups of 20 observations each, which is the same as our simulated data set.
set.seed(1)
km.out <- kmeans(data, 3, nstart = 20)
km.out$cluster
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
table(km.out$cluster, c(rep(1,20), rep(2,20), rep(3,20)))
##
## 1 2 3
## 1 20 0 0
## 2 0 20 0
## 3 0 0 20
(d) Perform K-means clustering with K = 2. Describe your results.
set.seed(1)
km.out <- kmeans(data, 2 , nstart = 20)
km.out$cluster
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
table(km.out$cluster, c(rep(1,20), rep(2,20), rep(3,20)))
##
## 1 2 3
## 1 20 20 0
## 2 0 0 20
In this case,K-means algorithm with K=2 forced our simulated data to be seperated by 2 clusters So the second class “2”(observations 21-40) have been absorbed and now we have only 2 classes “1” and “2”. To be more specific, the first class has absorbed all the points from the class “2” and from the table we can see that we have 40 observations in class “1”. On the other hand, class “3” has its initial 20 observations.
It is more clearly if we observe the plot below.(class “1” contains observations 1-40, class “3” contains observations 41-60)
plot(km.out$cluster)
(e) Now perform K-means clustering with K = 4, and describe your results.
set.seed(1)
km.out <- kmeans(data, 4 , nstart = 20)
km.out$cluster
## [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 3 3 1 3 3 1 3 3 1 1 1 3 1 1 3 1 3 1 1 3
table(km.out$cluster, c(rep(1,20), rep(2,20), rep(3,20)))
##
## 1 2 3
## 1 0 0 10
## 2 0 20 0
## 3 0 0 10
## 4 20 0 0
In this case,K-means algorithm with K=4 has seperated our simulated data in 4 clusters(randomly assigned class labels “1”,“2”,“3” and “4”). We can observe, that our first two classes have maintened 20 observations each (in our case class “2” and class “4”). On the other hand, our third class has been divided exactly in 2 different clusters “3” and “1” of 10 observations each.
From the graph below, we can also see how our third class (observations 41-60) has been divided into two new clusters “3” and “1”.
plot(km.out$cluster)
(f) Now perform K-means clustering with K = 3 on the first two principal components, rather than on the raw data. That is, perform K-means clustering on the 60 ?? 2 matrix of which the first column is the first principal component???s corresponding eigenvector, and the second column is the second principal component???s corresponding eigenvector. Comment on the results.
Answer
summary(data_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 6.9270 0.38928 0.37229 0.3465 0.33571 0.32736
## Proportion of Variance 0.9597 0.00303 0.00277 0.0024 0.00225 0.00214
## Cumulative Proportion 0.9597 0.96269 0.96546 0.9679 0.97012 0.97226
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.31564 0.30591 0.29471 0.27916 0.2735 0.27046
## Proportion of Variance 0.00199 0.00187 0.00174 0.00156 0.0015 0.00146
## Cumulative Proportion 0.97425 0.97613 0.97786 0.97942 0.9809 0.98238
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 0.26307 0.26148 0.23994 0.23776 0.23237 0.22623
## Proportion of Variance 0.00138 0.00137 0.00115 0.00113 0.00108 0.00102
## Cumulative Proportion 0.98376 0.98513 0.98628 0.98741 0.98849 0.98952
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 0.22256 0.21446 0.20148 0.19804 0.19377 0.18047
## Proportion of Variance 0.00099 0.00092 0.00081 0.00078 0.00075 0.00065
## Cumulative Proportion 0.99051 0.99143 0.99224 0.99302 0.99378 0.99443
## PC25 PC26 PC27 PC28 PC29 PC30
## Standard deviation 0.17743 0.16316 0.16040 0.14963 0.1414 0.13455
## Proportion of Variance 0.00063 0.00053 0.00051 0.00045 0.0004 0.00036
## Cumulative Proportion 0.99506 0.99559 0.99610 0.99655 0.9970 0.99731
## PC31 PC32 PC33 PC34 PC35 PC36
## Standard deviation 0.12598 0.1230 0.11774 0.11283 0.10614 0.0999
## Proportion of Variance 0.00032 0.0003 0.00028 0.00025 0.00023 0.0002
## Cumulative Proportion 0.99763 0.9979 0.99821 0.99846 0.99869 0.9989
## PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.09753 0.09384 0.08487 0.08134 0.07672 0.06547
## Proportion of Variance 0.00019 0.00018 0.00014 0.00013 0.00012 0.00009
## Cumulative Proportion 0.99908 0.99926 0.99940 0.99953 0.99965 0.99974
## PC43 PC44 PC45 PC46 PC47 PC48
## Standard deviation 0.05734 0.05104 0.04689 0.04547 0.03597 0.02874
## Proportion of Variance 0.00007 0.00005 0.00004 0.00004 0.00003 0.00002
## Cumulative Proportion 0.99980 0.99985 0.99990 0.99994 0.99996 0.99998
## PC49 PC50
## Standard deviation 0.02356 0.01966
## Proportion of Variance 0.00001 0.00001
## Cumulative Proportion 0.99999 1.00000
To begin with,we can observe that our first principal component is a reflection of 95.97% of the variations in the data and the second principal component is 0.03%. The cumulative proportion of those PCA 1 and PCA 2 is 96.3% , by that we mean that the first two principal components explain around 96.3% of the variance in our simulated data. We can expect that by performing K-mean method with K=3 we will see the same separation into three clusters of 20 observations each.
We can also use a plot in order to see the proportion of variance explained in our simulated data set by each principal component.
plot(data_pca,type="l")
set.seed(1)
km.out <- kmeans(data_pca$x[,1:2], centers = 3, nstart=20)
summary(km.out$cluster)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 2 2 3 3
km.out$cluster
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
As we expected, by performing K-means algorithm with K=3 we have separated our simulated data in 3 clusters “1”,“2” and “3” which one of them contains 20 observations as when we used K-means method in the whole dataset(50 variables of 60 observations).We can also see it clearly from the table below.
table(km.out$cluster, c(rep(1,20), rep(2,20), rep(3,20)))
##
## 1 2 3
## 1 20 0 0
## 2 0 20 0
## 3 0 0 20
(g) Using the scale() function, perform K-means clustering with K = 3 on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (b)? Explain.
set.seed(1)
data.scaled <- scale(data, scale = TRUE)
kmscaled.out <- kmeans(data.scaled,centers=3)
kmscaled.out$cluster
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
First we need to scale our data so each variable to have standard deviation one, which is by default the case in scale() function. After that, we use K-means clustering with K=3 on our new scaled data. The results are the same as those we have obtained in question b. We have three different clusters “1”,“2” and “3” which each containing 20 observations.We can see it clearly from the table below.
table(km.out$cluster, c(rep(1,20), rep(2,20), rep(3,20)))
##
## 1 2 3
## 1 20 0 0
## 2 0 20 0
## 3 0 0 20
Of course the result is what we expected to have, as we initially generated our variables measured in the same type of units and the same order of magnitude. Then scaling wil not affect our result.
Although, in real life data scaling is recommended as we often have variables measured by different type of units and different order of magnitude. That would mean that some variables will seem to be more important than they actually are and that will have a negative effect in our analysis.